This practice is part of the subject Biomedical Data Science of the Degree in Data Science of the Universitat Politècnica de València, and taught by the Department of Applied Physics.
The measurement of data quality dimensions (DQ) is the central axis for the evaluation and improvement of data quality as well as for its correct and optimal use. Data quality dimensions are individual aspects or constructs that represent data quality attributes. To these can be associated one or more metrics, quantified with specific methods, as well as exploratory methods.
This practice is intended to provide an initial basis for the evaluation of DQ metrics. It will consist of the application of a series of methods for different dimensions of DQ. In the context of the maternal and child clinical setting, we will analyze a data file whose ultimate purpose is the monitoring of care indicators in this setting. Depending on the dimension, we will apply the methods and calculate the metrics both in general for the whole dataset and monitored over batches of time (by months), simulating the results of a DQ monitoring and continuous improvement system.
In some parts of the code we will find the text ##TODO## that we will need to complete. Additionally, we will have to discuss the results in those points where it is indicated. The deliverable of the practice will consist of the compilation in html of this R markdown file, using Knit, where the results of the execution and figures will be observed, and having completed the ##TODO## and commented the results.
We check that the working directory is in the one where we have the practice file and the folder with the data:
getwd()
Otherwise, we set it (change the example directory to ours):
setwd(".")
We install the required libraries and then load them into the working environment.
# install.packages("zoo", repos = "http://cran.us.r-project.org")
# install.packages("rts", repos = "http://cran.us.r-project.org")
# install.packages("plotly", repos = "http://cran.us.r-project.org")
# install.packages("devtools", repos = "http://cran.us.r-project.org")
# library("devtools")
# devtools::install_github('c5sire/datacheck')
# devtools::install_github("hms-dbmi/EHRtemporalVariability")
library("shiny")
library("zoo")
library("rts")
library("plotly")
library("datacheck")
library("EHRtemporalVariability")
We set the initial parameters of the data. The main date of the records, which will be used for the purpose of monitoring the delivery care indicators, is the date of birth.
# File name
fileName = "data/DQIinfantFeeding.csv"
# Whether it has a header or not
hasHeader = TRUE
# Main date column to be used for monitoring purposes
dateColumn = "BIRTH_DATE"
# Format of the previous date
dateFormat = '%d/%m/%Y'
# Which text string will represent missing data
missingValue = NaN
We load the file data/DQIinfantFeeding.csv in a data.frame named repository:
repository <- read.csv2(fileName, header=hasHeader, na.strings=missingValue)
# We collect the number of rows and columns
N <- nrow(repository)
D <- ncol(repository)
For monitoring purposes, we will use the zoo library (S3 Infrastructure for Regular and Irregular Time Series - Z’s Ordered Observations) to convert the data, the data.frame, to a format suited for batch analyses, the zoo format.
zooRepository <- read.zoo(repository,format = dateFormat,index.column = dateColumn)
One of the main uses of the maternal and infant data repository studied is the monitoring of quality of care indicators. In the field of newborn feeding, one of the most important indicators is whether there has been early initiation of breastfeeding in the delivery room. To calculate this indicator, we create the following function that will obtain the indicator for each batch of data received, so that we can apply it repeatedly for each batch given a frequency.
indicatorEBF_delroom <- function(dataset){
numerator = (dataset$NEO_MOMFBF_TYPE %in% 'During the 1st hour') &
(dataset$NEO_PFBF_TYPE %in% 'Delivery room') &
(dataset$DEL_MODE %in% c('Vaginal delivery', 'Thierry\'s spatulas', 'Forceps delivery', 'Vacuum delivery'))
denominator = (dataset$NEO_MOMFBF_TYPE %in% c('During the 1st hour', 'During the 2nd hour', 'During the 3rd hour','Breastfeeding does not start')) &
(dataset$NEO_PFBF_TYPE %in% c('Delivery room', 'Hospitalization unit', 'Breastfeeding does not start')) &
!(dataset$NEO_FBFEVAL_TY %in% 'Undesired breastfeeding') &
(dataset$DEL_MODE %in% c('Vaginal delivery', 'Thierry\'s spatulas', 'Forceps delivery', 'Vacuum delivery'))
indicator = sum(numerator)/sum(denominator) * 100
return(indicator)
}
Once the function is loaded in the environment, we can easily apply it to the batches of data at the desired time frequency using the apply family of functions from the xts (Raster Time Series Analysis) library. In this monthly case, we will use apply.monthly, to which we will pass as parameters the repository converted to zoo and the previously created function:
resIndicatorES2SC_delroom =apply.monthly(zooRepository, FUN=indicatorEBF_delroom)
plot(resIndicatorES2SC_delroom,xlab = "Date", ylab ="%",main = "Early breastfeeding start in the delivery room", ylim=c(0,100))
The graph above depicts the time trend of early initiation of breastfeeding in the delivery room from 2010 to 2014. Some highlights and key observations are detailed below:
General trend. The graph shows a high percentage of early initiation of lactation in the delivery room, with values ranging mostly between 80% and 100%. This indicates that, in most cases, newborns started breastfeeding within the first hour and in the delivery room environment.
Time variability. Although the percentage remains high, there is a slight month-to-month variability, which could reflect seasonal changes, clinical practices, or even differences in on-duty staffing or medical team responsiveness.
Absence of data. Observing the graph, it can be seen that information is missing in one segment, suggesting the absence of data in that period. This lack of data may be due to several reasons:
The lack of data in one section of the chart limits the continuous interpretation of the indicator over time, and can make it difficult to identify trends and patterns. As the gap is significant, it would be advisable to investigate the exact cause to determine whether the overall pattern of early breastfeeding initiation could be affected by this absence of information.
We will find the missing data in the repository and calculate the corresponding metrics. First, for each variable:
NAmatrix <- !is.na(repository)
sumNAmatrix <- apply(NAmatrix,2,sum)
completenessByColumn <- round(sumNAmatrix/N*100,2)
completenessByColumn
## PATIENT_ID DEL_TYPE_A DEL_MODE DEL_TYPE_ANE DEL_IN_INTRACA
## 100.00 100.00 100.00 100.00 100.00
## BIRTH_DATE NEO_ES2SC_TYPE NEO_ES2SCP_TYPE NEO_MOMFBF_TYPE NEO_PFBF_TYPE
## 100.00 100.00 100.00 100.00 100.00
## NEO_FBFEVAL_TY NEO_APGAR_NUM NEO_WEIGHT_NUM NEO_PHCORD_TXT NEO_BF_DIS
## 100.00 31.31 95.42 100.00 100.00
Next, we will calculate and display the overall percentage of missing data:
totalNonMissing <- sum(NAmatrix)
totalValues <- N * D
completenessByDataset <- round((totalNonMissing / totalValues) * 100, 2)
completenessByDataset
## [1] 95.12
In general, the data set presents a high degree of completeness, with 95.12% of data not missing. This suggests that, in general terms, most of the information needed for the analysis is available and the dataset is robust to carry out reliable studies.
When analyzing the situation by the different variables, we observe that several of them are 100% complete. These include:
However, not all variables present the same level of completeness. Specifically, the variable NEO_WEIGHT_NUM has a completeness of 95.42%, which means that there is a small percentage of missing data. Although this percentage is relatively low, birth weight is a critical indicator in neonatal studies, so it would be convenient to review the reasons behind these missing data and consider strategies to manage them, such as data imputation or sensitivity analysis.
More worrisome is the case of the NEO_APGAR_NUM variable, which presents a completeness of 31.31%. This implies that about 68.69% of the data are missing for this variable. Given that the Apgar score is a fundamental measure for assessing the immediate physical condition of the newborn, the absence of this information in such a high proportion of cases could limit the ability to perform complete and accurate analyses in this setting.
To monitor the completeness by temporary batches we will create a function that does this calculation for each batch it receives as a parameter, and returns the completeness for each column, the function dimCompletessByColumn:
dimCompletenessByColumn <- function(repository){
N = dim(repository)[1]
NAmatrix <- !is.na(repository)
sumNAmatrix <- apply(NAmatrix,2,sum)
completenessByColumn <- round(sumNAmatrix/N*100,2)
return(completenessByColumn)
}
Once the function is loaded in the environment, we can easily apply it to the batches of data at the desired time frequency using the apply family of functions from the xts (Raster Time Series Analysis) library. In this monthly case, we will use apply.monthly, to which we will pass as parameters the repository converted to zoo and the previously created function:
resCompletenessByColumn = apply.monthly(zooRepository, dimCompletenessByColumn)
Now, we can create a plot with the results using the plotly library (Create Interactive Web Graphics via ‘plotly.js’). First for each variable:
p <-
plot_ly(
x = index(resCompletenessByColumn),
y = resCompletenessByColumn[, 1],
name = names(resCompletenessByColumn)[1],
type = 'scatter',
mode = 'lines'
) %>%
plotly::layout(
title = 'Completeness by month',
xaxis = list(title = "Date"),
yaxis = list (title = "Completeness (%)")
)
for (i in 2:ncol(resCompletenessByColumn)) {
p = p %>% plotly::add_trace(y = resCompletenessByColumn[, i],
name = names(resCompletenessByColumn)[i],
mode = 'lines')
}
div(p, align = "center")
And secondly globally, being able to calculate the result from the variable resCompletenessByColumn and use the code to plot a single series from the previous code chunk:
globalCompletenessByMonth <- rowMeans(resCompletenessByColumn, na.rm = TRUE)
p <- plot_ly(
x = index(resCompletenessByColumn),
y = globalCompletenessByMonth,
name = "Global Completeness",
type = 'scatter',
mode = 'lines'
) %>%
plotly::layout(
title = 'Global Completeness by Month',
xaxis = list(title = "Date"),
yaxis = list(title = "Completeness (%)")
)
div(p, align = "center")
The first graph shows the evolution of the completeness of each variable over time, allowing us to observe specific patterns in the availability of data on a monthly basis. The vast majority of variables maintain 100% completeness consistently throughout the available period. However, other variables show greater variability:
Towards the end of the period, especially in 2014, there is a notable decrease in the completeness of both variables, which could indicate a change in the data collection process or specific difficulties in capturing information during those months.
As for the second graph, it shows the overall completeness of the dataset by month, aggregating all variables to calculate an overall completeness rate. In general terms, the overall completeness rate remains high, hovering between 93% and 97%, but a gradual decrease is observed, especially from 2013 onwards, which becomes more pronounced towards the end of the observed period. This overall decline suggests a possible systemic problem affecting the data collection processes or the completeness of the records themselves. The significant drop at the end of 2014 indicates a notable event or change in data recording practices that may require further investigation to identify its exact cause and, if necessary, take steps to improve data quality in the future.
We are going to analyze two multivariate consistency rules in the data. For this we will use the datacheck library (Tools for Checking Data Consistency), which allows us to write logical rules in R language, using the variable names of our data. These rules will be written in the attached file rules.R, the first one is provided as an example, and the second one should be coded based on the provided natural language expression rule.
# We read the rules file
rules = read_rules("rules.R")
# We evaluate the rules on our data
profile = datadict_profile(repository, rules)
# We show the account of errors for each rule
knitr::kable(profile$checks[,c(1,3,6)])
| Variable | Rule | Error.sum |
|---|---|---|
| DEL_MODE | !(sapply(DEL_MODE, tolower) %in% “elective caesarea”) | (sapply(DEL_TYPE_ANE, tolower) %in% c(“epidural anesthesia”, “epidural anesthesia / general anesthesia”, “spinal anesthesia”, “spinal anesthesia / general anesthesia”, “general anesthesia”)) | 220 |
| DEL_MODE | !(sapply(DEL_MODE, tolower) %in% “emergency caesarea”) | (sapply(DEL_TYPE_ANE, tolower) %in% c(“spinal anesthesia”, “spinal anesthesia / general anesthesia”, “epidural anesthesia / general anesthesia”, “general anesthesia”)) | 245 |
# We list the cases that have been marked as inconsistent for each rule
library(kableExtra)
knitr::kable(profile$checks[, c(1,7)]) %>%
kable_styling() %>%
scroll_box(width = "100%")
| Variable | Error.list |
|---|---|
| DEL_MODE | 472,500,525,583,636,638,654,683,698,699,729,793,802,811,821,838,846,865,866,890,918,969,975,985,987,1008,1010,1013,1066,1076,1082,1129,1150,1190,1194,1241,1246,1270,1300,1367,1383,1384,1388,1402,1411,1428,1433,1445,1446,1453,1459,1501,1508,1531,1559,1575,1578,1612,1616,1635,1665,1667,1669,1674,1685,1692,1719,1753,1770,1774,1781,1784,1787,1790,1791,1800,1802,1836,1839,1840,1855,1857,1861,1874,1883,1902,1925,1928,1933,1939,1950,1963,1970,1972,1984,1987,2000,2001,2010,2049,2055,2058,2066,2102,2122,2134,2135,2145,2169,2251,2278,2292,2322,2334,2348,2368,2369,2386,2401,2409,2457,2474,2493,2518,2530,2550,2551,2563,2583,2591,2592,2623,2624,2632,2647,2676,2679,2680,2686,2706,2716,2732,2762,2767,2778,2822,2857,2865,2883,2901,2907,2920,2923,2934,2935,2943,2963,2965,2981,2985,3002,3006,3016,3018,3024,3029,3039,3073,3099,3106,3107,3119,3126,3158,3179,3183,3188,3201,3216,3221,3222,3240,3254,3256,3259,3323,3344,3356,3359,3363,3365,3380,3391,3410,3418,3427,3443,3484,3519,3531,3550,3553,3560,3575,3592,3603,3618,3622,3627,3630,3638,3648,3660,3670,3676,3677,3698,3704,3707,3761 |
| DEL_MODE | 58,60,66,157,165,167,182,224,251,256,270,301,322,323,347,350,375,382,392,395,428,463,473,484,492,495,503,508,512,521,524,526,534,538,539,543,552,556,572,576,585,592,597,601,604,607,616,617,618,633,635,640,641,643,652,658,660,663,669,677,680,684,688,689,696,702,707,710,711,712,714,717,733,740,742,748,759,761,763,767,768,770,776,779,782,791,795,797,804,814,828,836,839,842,844,847,857,874,878,882,888,892,901,903,910,911,914,917,919,925,926,930,931,935,941,946,961,971,977,998,1007,1017,1022,1023,1024,1025,1032,1033,1035,1048,1050,1051,1052,1053,1056,1060,1062,1067,1072,1078,1084,1086,1088,1095,1103,1110,1111,1124,1126,1131,1132,1133,1135,1137,1139,1140,1143,1144,1146,1148,1163,1171,1179,1184,1187,1188,1191,1197,1199,1210,1214,1216,1220,1224,1234,1237,1244,1250,1255,1257,1258,1268,1273,1278,1282,1283,1286,1291,1294,1295,1297,1299,1308,1321,1329,1337,1353,1355,1357,1360,1361,1369,1376,1378,1380,1382,1385,1395,1400,1412,1413,1414,1417,1421,1426,1434,1438,1440,1442,1450,1456,1461,1464,1465,1472,1473,1476,1477,1479,1481,1488,1496,1504,1512,1519,1522,1537,1565,1568,1581,1583,1600,1604,1610,3149 |
In applying the two consistency rules to the data set containing 3801 records, inconsistencies have been identified that require attention. The first rule states that if the mode of delivery is elective cesarean section, the anesthesia used should be epidural, epidural/general, spinal, spinal/general or general. According to the results, there are 220 cases that do not comply with this rule. The second rule indicates that if the mode of delivery is an emergency cesarean section, the anesthesia used should be spinal, spinal/general, epidural/general or general. In this case, 245 records were found that violate the rule.
These figures represent approximately 5.79% and 6.45% of the total records for the first and second period, respectively. Overall, about 12% of the data present inconsistencies regarding the mode of delivery and type of anesthesia administered. This percentage is significant and suggests the existence of problems in the quality of the data or in the recording processes.
Addressing these inconsistencies will improve the overall quality of the data and increase the reliability of data-based analyses and decisions. Corrective actions, such as reviewing and correcting data, training staff in data recording, and updating consistency rules to more accurately reflect current clinical practices, are essential.
We are going to analyze if there are pattern changes in the data distributions over time. To do this we will use the EHRtemporalVariability library (Delineating Temporal Dataset Shifts in Electronic Health Records). First, we change to basic type Date the case date, originally in text format:
repositoryFormatted <- EHRtemporalVariability::formatDate(
input = repository,
dateColumn = "BIRTH_DATE",
dateFormat = dateFormat
)
We obtain the temporal maps from the already formatted repository using the function estimateDataTemporalMap and selecting a monthly period. We can get the help of the function by typing ?estimateDataTemporalMap in the console (it is only necessary to pass the data, the column with the date, and the desired period, the rest is obtained automatically from the data).
probMaps <- estimateDataTemporalMap(
data = repositoryFormatted,
dateColumn = "BIRTH_DATE",
period = "month"
)
Next we obtain the information geometry plots from the previously estimated temporal maps. To do this we will use the function estimateIGTProjection on the list of temporal maps. We are going to save the results in a variable.
igtProjs <- sapply( probMaps, estimateIGTProjection )
names( igtProjs ) <- names( probMaps )
We can observe as an example the data temporal map and information geometry temporal (IGT) plot of the variable type of anesthesia during delivery DEL_TYPE_ANE:
plotDataTemporalMap(probMaps$DEL_TYPE_ANE)
div(plotIGTProjection(igtProjs$DEL_TYPE_ANE, trajectory = TRUE), align = "center")
In this example, we can see that over time there are changes or differences associated with synonymy of terms (No anesthesia, no, Without anesthesia), or even differences between upper and lower case (spinal, Spinal).
Next, we are going to save the results of the temporal variability estimation in a file, so that we can upload them to the web application EHRtemporalVariability, in the Load your RData tab.
save(probMaps, igtProjs, file = "variabilityData.RData")
Either using the web application, or obtaining the graphs in RStudio, we will study the temporal evolution of the type of delivery variable DEL_MODE, and discuss the results and their potential implications in the problem defined at the beginning of the practice on the monitoring of the early onset of lactation indicator, as well as propose a possible solution.
The first graph, a temporal heat map, shows the probability distribution of the different types of anesthesia used during delivery (variable DEL_TYPE_ANE) over time. Abrupt changes in the use of certain terms to describe anesthesia are observed, reflecting problems of consistency in data entry. For example, at certain periods the term “no” is primarily used to indicate the absence of anesthesia, while at other times this term virtually disappears and is replaced by “Without anesthesia.” This alternation suggests a lack of standardization in the record, which could lead to misinterpretation of a change in clinical practice when, in fact, it is a terminological variation. Something similar occurs with the terms “spinal” and “Spinal,” “epidural” and “Epidural,” as well as with “local” and “Local,” where differences in capitalization generate apparent fluctuations in the use of these types of anesthesia over time.
The second graph, a temporal informative geometry visualization (IGT), shows three major groupings corresponding to different periods: the points from 2009, the points from 2010, and the points from 2011 througsh 2014. These groupings suggest that the characteristics of the anesthesia data remained relatively consistent within each of these time blocks, generating well-defined clusters in three-dimensional space.
In addition to these three main clusters, several isolated individual points are observed that lie far from both these clusters and each other. These isolated points, or outliers, represent specific periods in which anesthesia data distributions differ significantly from general trends. These outliers could reflect anomalies in data entry, point changes, or errors in recording that caused them to diverge from the more consistent groupings.
Within each of the three main groups, there is no clear pattern of temporal variation that can be commented on in depth. This indicates that, within each time block, the practices or characteristics of the anesthesia data remained relatively homogeneous, with no progressive evolution or change that stands out. The absence of internal patterns within groups suggests that the differences observed between time blocks may be more related to systematic changes or inconsistencies in data entry than to a gradual evolution in anesthesia practices over time.
The maternal and infant data studied have been kindly provided by Ricardo García de León for educational purposes only. Their use or distribution outside this subject and practice is forbidden.